Fix stateless HTTP task accumulation causing memory leak#2145
Open
wiggzz wants to merge 15 commits intomodelcontextprotocol:mainfrom
Open
Fix stateless HTTP task accumulation causing memory leak#2145wiggzz wants to merge 15 commits intomodelcontextprotocol:mainfrom
wiggzz wants to merge 15 commits intomodelcontextprotocol:mainfrom
Conversation
67d02d3 to
a7a340d
Compare
2 tasks
wiggzz
added a commit
to dbt-labs/mcp-python-sdk
that referenced
this pull request
Mar 6, 2026
Use request-scoped task groups for stateless requests instead of the manager's global task group. When a client disconnects while a tool call is in progress, the request-scoped task group is cancelled, preventing zombie tasks from accumulating. In production, this caused 449 leaked ServerSession objects, 66,086 orphaned coroutines, and ~150 MB of leaked heap. Upstream PR: modelcontextprotocol#2145 Github-Issue: modelcontextprotocol#1764
wiggzz
added a commit
to dbt-labs/mcp-python-sdk
that referenced
this pull request
Mar 9, 2026
Use request-scoped task groups for stateless requests instead of the manager's global task group, preventing zombie task accumulation when clients disconnect mid-request. Upstream PR: modelcontextprotocol#2145
wiggzz
added a commit
to dbt-labs/mcp-python-sdk
that referenced
this pull request
Mar 9, 2026
…tprotocol#756) In stateless mode, each request spawned a `run_stateless_server` task into the manager's global `_task_group`. After `handle_request()` completed and `terminate()` was called, the task continued running inside `app.run()`, blocked on `async for message in session.incoming_messages`. These zombie tasks accumulated indefinitely, leaking memory. Replace the global task group spawn with a request-scoped task group so that server tasks are automatically cancelled when their request completes. Add a regression test that simulates the real blocking behavior of `app.run()` using `anyio.sleep_forever()` and verifies no tasks linger in the global task group after requests finish. Closes modelcontextprotocol#756 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntextprotocol#1764 (open) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add anyio.sleep(0) between terminate() and cancel_scope.cancel() so the message router can observe the ClosedResourceError from the closed streams before the task group is cancelled (restores line 1022 coverage) - Remove handle_list_tools from test — it was never called (only call_tool is exercised), so its return was an uncovered line - Mark the CallToolResult return as pragma: no cover — intentionally unreachable in this test because the task is always cancelled at await tool_gate.wait() before the return executes - Collapse multi-line function signatures to satisfy ruff-format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) yield in the stateless request handler allows the event loop to exercise exception paths that were previously unreachable with mocked tests. These paths are legitimately reachable and the more realistic test now covers them consistently: - streamable_http_manager.py: except Exception in run_stateless_server - streamable_http.py: except Exception in SSE pipeline (x2) - streamable_http.py: if not self.mcp_session_id (always true in stateless mode) - session.py: except Exception in receive loop cleanup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) between terminate() and cancel_scope.cancel() was non-deterministic: sometimes the event loop ran exception handlers (covering the pragmas), sometimes the cancel won the race (leaving lines uncovered). This caused strict-no-cover to flip between "wrongly marked" and "missing coverage" across CI runs. Revert to the simple, deterministic approach: - Restore all original pragma: no cover annotations - Add pragma: no cover for the ClosedResourceError/_terminated path in the message router, which is unreachable with request-scoped task groups (Cancelled always wins over ClosedResourceError) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The if-branch check itself is reachable; it's the body that isn't. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cancel the request-scoped task group first so the Cancelled exception deterministically reaches the server task before terminate() closes the streams. This avoids a race between Cancelled and ClosedResourceError in the message router. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change 5 pragma annotations from strict "no cover" to "lax no cover" since these exception handlers may or may not fire depending on cancellation timing. Fix ruff-format in test file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark 4 test assertions as lax no cover — coverage.py on Python 3.11 falsely reports them as uncovered despite the test passing. Also fix ruff format for assert message and async with blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add None check assertion and type: ignore for accessing private _tasks attribute on TaskGroup (needed to verify no leaked tasks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach ran handle_request inside a child task within the request-scoped task group. This caused coverage.py on Python 3.11 to lose track of execution when returning from the task group back to the caller — the root cause of the false coverage gaps on test assertions. The fix is simpler: run handle_request directly in the async with body (like the original code on main), with only run_stateless_server as a child task. This preserves the request-scoped task group for preventing task accumulation while keeping the caller's execution context intact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Terminate all active transports (stateful and stateless) before cancelling the task group during shutdown. This closes in-memory streams so EventSourceResponse can send a final more_body=False chunk — a clean HTTP close instead of a connection reset. Without this, the request-scoped task groups introduced for the memory leak fix isolate stateless transports from the manager's cancel scope, so nothing ever closes their streams on shutdown.
92d4f67 to
3179143
Compare
wiggzz
added a commit
to dbt-labs/mcp-python-sdk
that referenced
this pull request
Mar 10, 2026
Request-scoped task groups for stateless requests prevent zombie task accumulation when clients disconnect mid-request. Graceful drain on shutdown terminates all active transports before cancelling the task group, allowing SSE responses to close cleanly. Backported to v1.26.0 for compatibility with dbt-mcp and ai-codegen-api. Upstream PR: modelcontextprotocol#2145
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_stateless_servertasks accumulate indefinitely in the manager's global task group after client disconnects_task_groupfor stateless requests, ensuring tasks are cancelled when requests completeMotivation and Context
Related to #1764 (the underlying task accumulation issue in stateless mode).
Fixes #1272 (graceful shutdown / SSE drain).
Root cause — memory leak
When a client disconnects while a tool call is in progress,
EventSourceResponsedetects the disconnect and tears down the SSE stream pipeline. However,Server.run()— which is processing the tool call in a separate task spawned into the manager's global_task_group— continues running. After the tool handler completes, the response has nowhere to go (the request streams were cleaned up), andServer.run()blocks indefinitely onasync for message in session.incoming_messages. These zombie tasks accumulate one per disconnected request.In production, we observed 449 leaked
ServerSessionobjects, 66,086 orphaned coroutines, 4,463 blocked stream receive operations, and ~150 MB of leaked heap — all from this pattern.Root cause — broken SSE close on shutdown
The memory leak fix moves stateless server tasks from the manager's global task group to request-scoped task groups. This fixes the leak, but it also means the manager's
tg.cancel_scope.cancel()during shutdown no longer reaches stateless transports — nothing ever closes their streams, so SSE responses hang indefinitely. A reverse proxy like nginx would log "upstream prematurely closed connection while reading upstream".The graceful drain fix adds explicit
terminate()calls inrun()'s finally block (for both stateful and stateless transports) before cancelling the task group. This closes the in-memory streams, lettingEventSourceResponsesend a finalmore_body=Falsechunk — a clean HTTP close.Why the existing test didn't catch the leak
The existing test (
test_stateless_requests_memory_cleanup, added in PR #1140) mocksapp.runto return immediately, so the spawned tasks complete naturally. It never exercises the real-world scenario where client disconnects leaveapp.run()blocked.The fixes
Memory leak: Replace the global task group spawn with a request-scoped
anyio.create_task_group(). Both the server task and the request handler run inside this scoped group. When the request handler finishes (or is cancelled due to client disconnect), the scope is cancelled, which automatically cancels the server task. No tasks leak into the manager's long-lived task group.Graceful drain: Track in-flight stateless transports in
_stateless_transports. Inrun()'s finally block, callterminate()on all active transports (stateful via_server_instances, stateless via_stateless_transports) beforetg.cancel_scope.cancel().Testing
test_stateless_requests_task_leak_on_client_disconnect: RealServer.run()with a real tool handler, real SSE streaming, simulated client disconnect — verifies zero leaked tasks after 3 disconnected requeststest_graceful_shutdown_closes_sse_streams_cleanly: Uses an ASGIsend()interceptor to verify that SSE responses receivemore_body=False(the ASGI equivalent of a proper HTTP chunked transfer close) during manager shutdown. Without the fix, the test times out because nothing ever closes the stateless transport's streams.All 13 tests pass.
Breaking changes
None. This is an internal implementation change. The public API and behavior are unchanged.
Checklist